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Abstract An increasing number of methods for background 
subtraction use Robust PCA to identify sparse foreground 
objects. While many algorithms use the ^i-norm as a con- 
vex relaxation of the ideal sparsifying function, we approach 
the problem with a smoothed ^^-norm and present pROST, a 
method for robust online subspace tracking. The algorithm 
is based on alternating minimization on manifolds. Imple- 
mented on a graphics processing unit it achieves realtime 
performance. Experimental results on a state-of-the-art bench- 
mark for background subtraction on real-world video data 
indicate that the method succeeds at a broad variety of back- 
ground subtraction scenarios, and it outperforms competing 
approaches when video quality is deteriorated by camera jit- 
ter. 

Keywords Background Subtraction • Robust PCA • Online 
Subspace Tracking • CUDA 



1 Introduction 

Many high-level computer vision tasks like object tracking, 
activity recognition and camera surveillance rely on a pixel- 
level segmentation of scenes into foreground and background 
as a preprocessing step. This task is often referred to as back- 
ground subtraction and has drawn great attention in recent 
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years. Surveying the multitude of existing methods is be- 
yond the scope of this article, and for this purpose we refer 
to two excellent recent surveys of the field, IH and lfT4ll . 

Robust Principal Component Analysis algorithms have 
been proven successful at separating foreground objects from 
a static or dynamic background |11|. The underlying as- 
sumption of Robust PCA is that the analyzed data can be 
considered a superposition of a low-rank and a sparse com- 
ponent, which can be separated blindly without any further 
assumptions on the data. For many video sequences this as- 
sumption holds true. The vectorized frames of a video back- 
ground span a low-dimensional subspace, whereas rapidly 
moving objects appear sparse in space and time and thus can 
be distinguished from the background using Robust PCA. 
Most Robust PCA algorithms focus on processing the com- 
plete data set at once in a batch-processing manner. This 
means that all frames of the video sequence and their statis- 
tics are available and the algorithm performs background 
subtraction on the entire sequence. Recently, methods have 
been presented which allow for online subspace tracking O, 
i.e. video data can be processed frame by frame and each 
new incoming data sample contributes to the estimate of the 
background. 

This paper introduces a robust online background sub- 
traction algorithm, called pROST: a smoothed ^^-norm Robust 
Online Subspace Tracking Method. The name reflects the 
two defining characteristics of the algorithm. Firstly, to achieve 
robustness against outliers we use a smoothed and weighted 
^^-(pseudo)-norm cost function. Secondly, an eflftcient al- 
ternating online optimization framework for the estimating 
the subspace makes the algorithm suitable for online sub- 
space tracking. The algorithm is tailored for real-time back- 
ground subtraction in streaming video and makes use of the 
spatio-temporal dependencies between pixel labels, i.e. the 
foreground or background assignment on a pixel level. This 
leads to especially good performance in videos that require 
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bootstrapping, which means learning a new background from 
corrupted data. It also alleviates problems with large fore- 
ground objects, which often arise in PCA-based methods 
141 . In comparison to other methods we observe that our 
algorithm is particularly good at dealing with the varying 
background in video recorded by jittery cameras. 

One of the main difficulties with comparing different 
background subtraction methods has been the lack of an 
accepted benchmark. Various data sets exist (e.g. 1 18] and 
1221), which provide video sequences and manually seg- 
mented test images. However, the lack of pixel-level ground 
truth has led to rather selective evaluation instead of compa- 
rable and representative results, as the authors of the SABS 
dataset criticize |5|. They overcome the cumbersome task 
of hand-segmenting video sequences by providing an arti- 
ficially rendered scene, which allows a very detailed and 
precise segmentation. Although the animations are close to 
photo-realistic, the visual impression is fundamentally dif- 
ferent from true recordings. 

In order to establish a benchmark on real-world video 
sequences, the changedetection.net dataset 1 10] has been in- 
troduced. It provides image sequences and full ground truth 
for a variety of categories such as static and dynamic back- 
ground, thermal imaging and camera jitter, as well as the ex- 
plicit distinction between foreground objects and their shad- 
ows. Seven statistical error measures are computed to eval- 
uate the performance as detailed as possible. This prohibits 
tuning a method for a single performance measure and guar- 
antees significant scores. Evaluation and thus the ranking of 
all competing methods are computed per category and as an 
overall average. All reported results are conveniently acces- 
sible on the project website 

The paper is outlined as follows: in Section|2]of this pa- 
per we define our understanding of foreground and back- 
ground, give a brief overview of the issues arising in back- 
ground learning and maintenance and explain how a back- 
ground model can be used for foreground segmentation. We 
describe how PCA can be used to create a model of the scene 
background and motivate the use of robust cost functions. In 
Section[3]we present the pROST framework, which is moti- 
vated and discussed in the context of background modeling 
and foreground segmentation. Section]?] provides details on 
the implementation on a graphics processing unit. We eval- 
uate our algorithm on the changedetection.net dataset and 
discuss the results in Section [5] Concluding the paper, we 
analyze typical issues in the modeling of scene backgrounds 
with pROST and explain with a few examples how they are 
addressed by the choice of parameters. 



2 Robust PCA based background models 

Video background is commonly defined as the union of per- 
sistent elements of a scene. They can be static or may exhibit 
repetitive dynamics, which either occur on an object-level, 
e.g. an escalator or a fountain, or on a global scale, e.g. wa- 
ter or waving trees, but also camera jitter. In other words, 
the background is comprised of elements that are known, 
predictable and not of interest for higher-level tasks such 
as surveillance or activity recognition. Everything else that 
moves about in the scene is understood as foreground ob- 
jects. From this definition the idea of treating foreground- 
background segmentation as an outlier detection problem 
arises naturally, i.e. a model of the background is established 
and the foreground is segmented by comparing each video 
frame to this model. Elements of the video frame that do not 
fit the background model are labeled as foreground, while 
the rest is labeled as background. Virtually every algorithm 
published so far follows this approach, but they diff'er in the 
type of background model that is used and how it is main- 
tained, and in potential pre- and post-processing steps. 

Establishing and maintaining an accurate background model 
is not trivial under real- world circumstances, and certain re- 
quirements have to be met by a method to be useful in chal- 
lenging scenarios. For example, a scene cannot be observed 
in all possible lighting or weather conditions. Or it might 
be impossible to have a separate training stage in which the 
scene is free of foreground objects. Therefore, the ability to 
learn a background model from corrupted training data is of 
crucial importance. Without having any pixel semantics this 
is of course only possible if foreground objects have dif- 
ferent statistical properties in time and space than the back- 
ground. Furthermore, it is also necessary to update the back- 
ground model continuously, which is commonly referred to 
as model maintenance. 



2.1 PCA background models 

PCA background models were introduced in 1201 as part 
of a system for human activity recognition. The underly- 
ing assumption of PCA models is that a vectorized video 
background B e W^^^^ can be represented as a product of a 
subspace-defining matrix 

UeStk,n,:={UeR"''"'\U^U = Ik} (1) 

and the subspace coordinates Y e R^^". Sty^,^ denotes the 
Stiefel manifold and 4 is the ^-dimensional identity matrix. 
With the supplement of a Gaussian noise matrix E e R'^x" 
whose (/, 7)-entries qj ~ ATCO, cr) are independent Gaussian 
random variables, this results in the data model 
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B=UY + E. 



(2) 
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Considering now a video sequence X e R"^^", which 
might also contain additional foreground objects, one can 
try to recover the matrix U that defines the subspace and Y 
as the solution of the optimization problem 



min \\X-UY\\2 



(3) 



where || • II2 denotes the ^2-norm. This is equivalent to the 
classic PC A problem |21 1, whose well-known closed-form 
minimizer is given by the leading k eigenvectors of the data 
covariance matrix. 

Given a basis U for the background subspace and an 
observed image x e 3. foreground segmentation mask 
F G {0, l}"^ can be obtained through fitting the model by 
firstly solving 



/ = argmin||x- Uyh 

followed by applying a thresholding operation 

1 if \Xi-uJy*\> S 



else, 



(4) 



(5) 



where uj denotes the i-th row of U and ^ > is a threshold 
parameter. 

PC A-based methods for background estimation show great 
advantages over competing methods whenever backgrounds 
are dynamic and include illumination changes |4 |. However, 
as the classical PCA approach uses the ^2-norm, the sub- 
space reconstruction may severely suff'er from outliers. In 
the practical context of background modeling, several prob- 
lems can be observed: Firstly, in common PCA undue weight 
is given to the foreground elements when fitting the back- 
ground model to camera frames during the segmentation 
process, which severely limits the admittable size of fore- 
ground objects. Secondly, images containing foreground ob- 
jects can lead to corruption of the background model during 
bootstrapping and background maintenance. Finally, batch- 
processing results in tremendous memory requirements. 

As discussed in IH, a lot of eff'ort has been spent to over- 
come these limitations. The predominant mechanism for achiev- 
ing robustness in PCA-based methods is weighting or re- 
placing individual pixels in order to reduce the influence of 
known foreground objects on the background reconstruction 
error. Adaptive thresholding 1 24 1 has been proposed to allow 
for larger foreground objects and an attempt at a robust in- 
cremental estimation of backgrounds has been presented in 
C3. 



2.2 Robust background estimation 

The computer vision community has recognized that Ro- 
bust PCA methods ofl'er substantial advantages over clas- 
sic PCA and background modeling has become increasingly 



popular as an application of Robust PCA algorithms. For an 
overview of background subtraction using Robust PCA we 
refer to 1 11 1. 

The typical assumption for Robust PCA O is a data 
model 



X = L-^S, 



(6) 



where S e R^^^ is sparse (i.e. having few non-zero entries) 
and L e R^^" is low-rank. Under mild assumptions on L and 
S it is possible to recover them via 



(L*,^*) = arg min 11^ I 

rk L<k 



s.t.X = L-^S. 



(7) 



A Robust PCA method is proposed in O that performs 
a convex relaxation of ^ employing an ^1 -penalized outlier 
matrix S and minimization of ||L||*, which denotes the nu- 
clear norm. Under specific circumstances this convex method 
is able to recover the low-rank component exactly. However, 
only a whole batch of data samples can be processed and the 
proposed solvers do not achieve realtime performance. The 
authors of GoDec 1251 report a significantly faster process- 
ing time, which is achieved by using random projections. 
The method is robust against additive Gaussian noise, but it 
requires an estimate on the cardinality of the sparse compo- 
nent. 

A diff'erent way of searching for a low-rank approxi- 
mation of given data is to employ the so-called Grassman- 
nian, which is the manifold of fixed-dimensional subspaces 
El, ifTTl . In 121 it is shown how the Grassmannian can be 
exploited for online subspace tracking, i.e. analyzing data 
sample-wise and constantly adapting its low-rank approxi- 
mation using a one-step gradient descent. The authors fur- 
thermore demonstrate that subspaces can be reconstructed 
even from highly subsampled data if the upper bound on the 
desired rank is very low compared to the dimension of the 
data, which is in the same spirit as | 23 1. Finally, the GRASTA 
method 1 15] robustifies subspace tracking using an ^i-norm 
cost function and achieves close to realtime performance on 
an online background subtraction task. 

Approximating the impractical ^o-norm by the ^i-norm 
off'ers the advantage of obtaining a convex problem with a 
guaranteed globally optimal solution. However, it is known 
that other measures such as an ^^-norm off'er a better approx- 
imation of the ^o-iiorm, cf. Igl. In |[T2l a way is shown how 
these kind of ^q- surrogates can be incorporated in an alter- 
nating minimization framework for robust subspace estima- 
tion and tracking. Numerical results and online background 
subtraction experiments indicate that using a smoothed £p- 
norm sparsifying function increases the robustness of such 
kind of methods even further. This paper builds on the re- 
sults of both 1121 and (TJl, and we present a realtime ro- 
bust online subspace tracking method based on alternating 
minimization of a smoothed ^^-norm sparsifying function 
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on manifolds using one- step gradient and conjugate gradi- 
ent descent. 

3 The pROST algorithm 

As with the classic PCA based models in Section \2A\ we 
assume that an image x g R"^ is generated by a background 
subspace model with the addition of Gaussian noise e eW^ 
and a sparse outlier vector s eW^ which represents the fore- 
ground in the scene, i.e. 

X = Uy -\- s -\- €. (8) 

Our goals are (i) to robustly recover this background sub- 
space U from training data containing foreground objects 
(i.e. s ^ 0), (ii) to robustly fit the model to unknown video 
frames in order to determine the foreground s, and (iii) to 
track any changes to the background subspace of a scene. 

3.1 Weighted smoothed ^^-norm cost function 

In |[T2l it has been shown that smoothed non-convex spar- 
sity measures allow the reconstruction of subspaces from 
corrupted data in cases where other methods fail. Thus, we 
construct the cost function based on the smoothed ^^-norm 
as 

m p 

/z^ : R"^ ^ R^, X ^ (jc^ -h yu)' , < < 1, (9) 

i=l 

where yu > is a smoothing parameter. This particular £o- 
surrogate serves as an arbitrary example, for other possible 
functions see |12J. Even though using ([9]) leads to a non- 
convex optimization problem, in practice it is good-natured 
and can be optimized locally by standard methods. 

The pROST algorithm is designed with background sub- 
traction for video streams in mind, and thus we can further 
tailor the cost function to this setting. In video data it is sen- 
sible to assume that spatial and temporal proximity of pix- 
els entail identical semantics. In other words, corresponding 
pixels in consecutive frames are likely to have the same la- 
bel. This knowledge can be used to further increase the ro- 
bustness of the residual cost. The idea is to reduce the contri- 
bution of labeled foreground pixels to the overall penalty by 
introducing additional pixel weights e R^, whose mag- 
nitudes depend on the labels assigned to the pixels in the 
previous frame. If the pixel was previously labeled a fore- 
ground pixel and is therefore likely to remain an outlier in 
the current frame, the weighting should be small to avoid 
foreground objects compromising the background. In the re- 
verse case, if the pixel was labeled a background pixel be- 
fore the weight should be equal to one to allow for model 
maintenance. In this way the algorithm avoids erroneously 



fitting the background model to already known foreground 
objects and it can focus on fitting the background model 
to the scene background instead. This extension to the cost 
function does not only ease bootstrapping from corrupted 
training data, but it also overcomes the reported difficulties 
of PCA methods with large foreground objects |4 |. 

We incorporate pixel-weighting by defining the weighted 
smoothed ^^-norm cost function 

: R"" ^ R^, X 2j ^K-^? + ' < < 1 (10) 

and the eventual cost function to be minimized in pROST is 
h^A^-Uy). (11) 

3.2 Optimization on the Grassmannian 

The topics of optimization on the Stiefel manifold and the 
Grassmannian are covered in great detail in 1 1 1 and |7 |. Here 
we only recall the most important results and apply them to 
our specific problem. 

In Section 12.11 we define U to be an element of the so- 
called Stiefel manifold Stk^m. However, optimizing over the 
entire set Stk,m is not necessary, because whenever (U,y) is 
a reasonable solution then so is ([/ Q, Q^y) for 

Q e Oik) := {Q e R^^^ | Q^Q = 4}, (12) 

where 0(k) denotes the set of ^ x ^-dimensional orthogo- 
nal matrices. In other words, we are only interested in the 
subspace spanned by the columns of U, and not in a par- 
ticular basis of that subspace, so the search space can be 
reduced. To that end we employ the well known Grassman- 
nian, which is defined as the quotient manifold 

Grk,m :=Stk,^/ 0(k\ (13) 

with the equivalence relation U - U if and only if there 
exists a 2 G 0(k) such that U = UQ. We denote the equiva- 
lence class for some representative U e Stk,m by 

[U] = {UeStk,m \ U-U}eGvk,m. (14) 

Note that the class [U] does not have a matrix representation 
in R"^^^. So whenever we store we will do that by using 
one (arbitrary) class representative. In contrast, the Stiefel 
manifold has a unique matrix representation, as does its tan- 
gent space, which is given by 

Tu Stk,m := [B e R'^^^ \B'^U-^ U^B = 0}. (15) 

Due to the quotient geometry of Gr^^,^, the spaces Tu Stk,m 
and TuQ Stk^^ share one subspace independently of Q, which 
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we identify with the tangent space of the Grassmannian, 
namely 



nxk 



U^B = 0}. 



(16) 



Optimization on Riemannian manifolds like the Grass- 
mannian is done by moving along tangent space directions 
on geodesies of the manifold. In order to do this, it is neces- 
sary to first project the ambient space gradient 



Qxahu = Vuh^^^ix - Uy) 



(17) 



onto the tangent space at U and then, in the case of mini- 
mization, move along the geodesic on the manifold in the 
opposite direction. 

Geodesies are curves that locally minimize the distance 
between two points on the manifold. For a given tangent 
direction B e T[u] Gr^^^^, the geodesies on the Grassmannian 
emanating from [U] in direction B are given by [U(t)] with 



U(t) = (UVcosiZt) -h 0sin(i:O)V^, 



(18) 



where SI^V^ = B is the compact Singular Value Decompo- 
sition of 5, cf. Q. 

It is easy to verify that the orthogonal projection of some 
H G R^^" onto T[u] Grk,m is given by 



7r(if ) = (/ - UU^)H. 



(19) 



Using all these components we can formulate a proce- 
dure in order to find the background subspace model U. We 
propose an alternating approach that iteratively updates U 
and y. The cost function that is minimized is invariant on 
equivalence classes [U] when considered with two variables 
(U,y). However, as this is no longer the case when y is fixed, 
it is not reasonable to search for an optimal element on Gr^^,^ . 
Thus, the optimization step for fixed y will be taken in the 
direction ;r(grabf/) along U(t) as defined in ([18]). 



3.3 An online alternating minimization algorithm 

In an online setting the video frames arrive at a certain rate 
and have to be processed as they arrive. Processing a frame 
-^(1+^) £ ^jj^g instance / -h 1 involves three steps, which 
are robustly fitting the background model to the frame, up- 
dating the background subspace model U to cope with changes 
in the background, and segmenting foreground and back- 
ground for the current frame: 



Step 1: Refine y^^^ to obtain /'^^^ via 

y 



'^'^^^ = arg min h^,^(x^'^^^ - U^^y) (20) 



Step 2: Take one gradient descent step along U (t) as defined 
in ( p^ to approximate 



t* = arg min hu ^(x 

teR ^' 



.0+1) 



U(t)/^-'^) 



(21) 



and to obtain the updated subspace := ). 



Step 3: Identify the outliers or the foreground pixels to ob- 
tain the reconstruction cost weighting w for the next iteration 



/ 1 if \xi - ujy\ < S 
I CO else. 



(22) 



where o) is the weight for the foreground pixel reconstruc- 
tion error. In order to be able to slowly incorporate fore- 
ground objects into the background, oj should be set to a 
small, but non-zero value. 

In Step 1 pROST uses a Conjugate Gradient (CG) al- 
gorithm 1 16] to perform the optimization. Even though one 
iteration of CG is more expensive than one iteration of a 
simpler gradient descent algorithm it needs fewer iterations 
to identify the outliers. Since most of the computational cost 
per iteration is due to the evaluation of the cost function and 
computation of the gradient, this actually leads to a more ef- 
ficient algorithm. Our experiments have shown, that in most 
cases as little as five CG iterations are suflftcient. 

In Step 2 pROST takes one gradient decent step on the 
Grassmannian. This would usually require the costly com- 
putation of the full SVD of the projected gradient. In the 
online setting, however, this can be avoided. The derivative 
of the cost function with respect to Uij is given by 

'^ = -pw^n(rf^^^p-% (23) 

with r = X - Uy. Using the short notation 

rji = -p Wi Vi (rf -h yu)^2 -0^ / = 1, . . . , m (24) 

the projected gradient can be expressed as 

Gu = my''- (25) 

It can easily be verified that Gjj has rank one and its SVD is 
given by 



Gjj - (TiSV^ with 
71{t]) 



(26) 



s = 



, (Ti = Mrj)\\2\\y\\2, v= --. (27) 
Hr])\\2 Wyh 



Consequently, we are freed of computing the SVD of the 
search direction at each iteration. This approach has also 
been taken in GROUSE | 2 | and GRASTA 1 15 1 in order to 
obtain a fast online gradient decent algorithm for subspace 
estimation. 
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3.4 Practical issues 

Initialization In this alternating scheme an initialization 
for U has to be provided. We choose to initialize the sub- 
space randomly, which can be performed by computing a 
reduced QR-decomposition of a random mxk matrix. This 
means that pROST does not use a separate batch initializa- 
tion phase. It is fully capable of recovering subspaces from 
video data corrupted with foreground objects. The background 
subspace is learned one frame at a time while continually re- 
ducing the step size from tinu to tmin- The former should be 
chosen quite high itinu e [10""^, 1]) to facilitate quick ini- 
tialization, while the latter should be chosen quite low to 
avoid trailing ghost images of moving foreground objects 
€ [10-^ 10-4]). 
The step size for the subspace updates in each iteration 
are defined by the step- size rule 



t = max{^ tjy 



(28) 



where / is the iteration and r is a parameter controlling the 
shrinkage rate for the step size reduction. Whenever an ini- 
tialization phase is defined by an exact number of frames 
iinit, the parameter r can be calculated as 



(29) 



Pre- and post-processing Firstly, the running average of 
the image data is maintained during the initialization phase 
and subtracted from each frame before pROST is applied. 
This means the background subspace has to capture only 
the dynamic aspects of the scene. Secondly, the images are 
normalized by dividing the intensity values by the sample 
standard deviation over all pixels in the initialization phase. 
In our experiments we observe that this kind of preprocess- 
ing is highly beneficial for capturing the scene dynamics. To 
achieve fast and uniform processing we re-sample all videos 
to a size of 160 x 120. 

Apart from the thresholding operation, we also apply a 
3x3 median filter to the foreground segmentation mask F to 
fill small holes and to get rid of small clusters of erroneously 
labeled pixels. 

Color images If colored video is available, it is clearly 
advantageous to use the information provided in the color 
channels for segmentation. We represent a colored vector- 
ized video frame of size m by a vector 



Xg 



5) 3m 



(30) 



where the /-th entry of xr,xg,xb e M!^ is given by the re- 
spective channel value at pixel /. Accordingly, the background 
subspace is modeled by 



U = 



Ur 
Ug 
Ub 



e St 



■k,3m • 



(31) 



The pixel / is classified as foreground if the diff'erence be- 
tween the reconstructed background of either of the chan- 
nels is large enough, i.e. if 

m^x{\xRj - ul -yl \xG,i - u^ -yl {xbj - u^jyl) > 6. (32) 

Here, UR^i, ugj, ubj denote the respective rows of U. 



4 Implementation 

In order to achieve realtime performance we have imple- 
mented pROST on a GPU. More precisely, the preprocess- 
ing and all steps of pROST are implemented on the GPU, 
whereas the median filtering operation in the post-processing 
stage runs on the CPU. For transferring the images to the 
GPU we use pinned host memory. 

One of the strengths of pROST is its simplicity. Since 
most of the operations involved are matrix operations, pROST 
can be parallelized very efficiently on a GPU using C-\-\-, 
CUDA and the highly optimized CUB LAS library for linear 
algebra operations on the GPU. Step 2 of pROST, for exam- 
ple, can be implemented with as little as four General Matrix 
Multiply (GEMM) operations and only takes about 5 ms for 
a subspace dimension of ^ = 15 and an image resolution of 
160 X 120 on a Nvidia GTX 660 GPU. 

For images in this resolution and a subspace dimension 
of 15, more than 50% of the computation time is spent on 
matrix multiplications, further 30% on basic operations like 
matrix addition, element- wise multiplication and evaluating 
the cost function. In order to optimize the implementation 
even further, we have taken great care to reduce the required 
number of matrix multiplications and order them in such a 
way as to reduce the overall complexity and memory re- 
quirements. Evaluating the cost function involves a paral- 
lel reduction, which has been implemented following the 
scheme presented in 1T31 . 



5 Evaluation 

Our goal for the evaluation is twofold. Firstly, we rank the 
pROST method for background subtraction among compet- 
ing methods on a widely-known benchmark. Secondly, we 
show that using the weighted smoothed ^^-norm instead of 
the ^i-norm leads to superior results for background subtrac- 
tion by comparing our method to GRASTA 1 15 1, a state-of- 
the-art representative of online Robust PCA. As mentioned 
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Fig. 1 Performance of the GPU implementation of pROST for several 
image scalings and subspace dimensions. Scaling is relative to 320 x 
240 images. 
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in Section[T]we conduct all experiments on the changedetec- 
tion.net dataset, and apart from discussing the results here 
we will also publish the results on the project website to al- 
low a quick and easy comparison with different approaches 
like GMM-based or non-parametric methods. As the bench- 
mark requires a static configuration for all scenarios we fix 
all parameters for a first overview and discuss their particu- 
lar influence afterwards in a more detailed investigation. 

5.1 Performance on the changedetection.net benchmark 

The changedetection.net dataset ifTOl consists of six cate- 
gories of videos and provides ground truth for each frame. 
The ground truth contains information about background 
and foreground objects as well as their boundaries and shad- 
ows. For some of the videos, the segmentation is evaluated 
only for certain regions of interest (ROI) while for others 
the whole image is evaluated. In order to produce compara- 
ble results, an evaluation tool is provided which computes 
significant statistical measures for the segmented images. 
The evaluation starts after a certain number of frames, which 
can be used for initialization. However, these training sam- 
ples have the same foreground-background distribution as 
the ones used for evaluation and can therefore contain fore- 
ground objects. 

For the benchmark evaluation we select the following 
parameters, which maximize the overall performance 

- ^ = 15 (subspace dimension), 

- a; = 5 X 10"^ (foreground weighting), 

- tinit = 5 X 10"^ (initial stepsize), 

- tmin = 10""^ (online stepsize), 

- S = 0.35 (threshold). 



- p = 0.25, /d = S^(l-p) (smoothed ^^-norm parameters). 

For each frame we perform a maximum of five CG steps for 
the optimization of ( [2Q| ). 

The detailed results for pROST are listed in Table [T] By 
varying the threshold parameter S e [0.05, 0.6] we obtain the 
ROC curves for all categories, which are displayed in Figure 

El 

In order to compare pROST to GRASTA HTSl we rely on 
the streaming version of GRASTA whose MATLAB imple- 
mentation is available for download on the author's web- 
sit^ This implementation is intended to work with gray 
scale images, whereas we work with RGB color images. To 
allow for a fair comparison we have modified GRASTA to 
work with such images. We use the same subspace dimen- 
sion as for pROST and down-sample all images to a resolu- 
tion of 160x120. GRASTA requires an initialization phase in 
which an initial background model is learned from a batch 
of training images. To allow the best possible outcome in 
this phase we use the largest possible set of training im- 
ages, i.e. all frames at the beginning of the videos that are 
not evaluated, and use all the pixels in each video frame to 
learn the subspace. We allow GRASTA to take three passes 
over the data, which means that it encounters each video 
frame three times as often as pROST during the initializa- 
tion process. We rely on the default parameters of the MAT- 
LAB implementation except for the detection threshold and 
the percentage of pixels used for updating the subspace dur- 
ing the tracking stage. The demo implementation suggests 
to use 10% of the pixels, while we use 25% of the pixels. 
The reason for not using all available pixels is that GRASTA 
is explicitly designed for reconstructing subspaces from in- 
complete information. In all experiments we have observed 
that indeed, GRASTA' s performance does not deteriorate 
markedly if the data is subsampled as the authors describe. 
We use a value of 0.2 for thresholding in GRASTA, which 
is twice as high as the threshold suggested by the authors. 
This choice is motivated by the fact, that the backgrounds 
in the changedetection.net benchmark are highly dynamic 
and a lower threshold would lead to excessive amounts of 
false positives. The obtained segmentation masks are post- 
processed by applying a 3 x 3 median filter. All benchmark 
results for GRASTA are listed in Table [2l 

5.2 Discussion 

The pROST method excels when the camera is jittering and 
ranks first in this category by a large margin. In the other cat- 
egories the method ranks mid-field. It is important to note, 
however, that it is possible to achieve better performance in 
the other categories by tuning the parameters individually 
(see Section [53] ). 

^ https ://sites. google.com/site/hejunzz/grasta 
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Table 1 Per category results for pROST in the changedetection.net benchmark 



Category 


Recall 


Specificity 


FPR 


FNR 


PWC 


Precision 


FMeasure 


basehne 


0.801 


0.9941 


0.0059 


0.199 


1.28 


0.805 


0.799 


camera jitter 


0.770 


0.9925 


0.0075 


0.230 


1.56 


0.825 


0.792 


dynamic background 


0.743 


0.9945 


0.0055 


0.257 


0.73 


0.566 


0.595 


intermittent object motion 


0.540 


0.9137 


0.0863 


0.460 


9.71 


0.488 


0.419 


shadow 


0.754 


0.9798 


0.0202 


0.256 


2.85 


0.671 


0.706 


thermal 


0.497 


0.9920 


0.0080 


0.503 


2.97 


0.756 


0.584 


overall 


0.684 


0.9778 


0.0222 


0.316 


3.18 


0.685 


0.650 



Table 2 Per category results for GRASTA in the changedetection.net benchmark 



Category 


Recall 


Specificity 


FPR 


FNR 


PWC 


Precision 


FMeasure 


baseline 


0.609 


0.9926 


0.0074 


0.391 


2.13 


0.740 


0.664 


camera jitter 


0.622 


0.9282 


0.0718 


0.378 


8.36 


0.354 


0.434 


dynamic background 


0.701 


0.9760 


0.0240 


0.299 


2.61 


0.262 


0.355 


intermittent object motion 


0.311 


0.9842 


0.0158 


0.689 


6.32 


0.515 


0.359 


shadow 


0.608 


0.9554 


0.0446 


0.392 


6.09 


0.536 


0.529 


thermal 


0.344 


0.9851 


0.0149 


0.656 


6.13 


0.726 


0.428 


overall 


0.533 


0.9702 


0.0298 


0.467 


5.27 


0.522 


0.461 



Fig. 2 ROC curves for pROST on the changedetection.net benchmark (bsl: baseline, cji: camera jitter, dyb: dynamicBackground, iom: intermit- 
tentObjectMotion ,sha: shadow, the: thermal, all: overall) 
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A strength of pROST is clearly dealing with fast varia- 
tions in the background like camera jitter and scenes with 
quickly moving foreground objects. The outstanding perfor- 
mance achieved in the camera jitter category, which mostly 
requires the initialization from video heavily corrupted with 
outliers, shows that the method can bootstrap in very diffi- 
cult situations. pROST is also capable of dealing with grad- 
ual lighting changes. The comparison to GRASTA shows 
that pROST's performance in the camera jitter category is 
not a general feature of PC A models, but rather a combined 
result of the cost function, the optimization methods and the 
introduced foreground pixel weighting. 

Situations in which the algorithm fails include the re- 
location of background objects, for which the performance 
in the intermittent object motion category is a clear indica- 
tor. These problems can be alleviated to some degree by ad- 



justing the parameters to control the speed of background 
adaptation. The underlying problem, however, is that the al- 
gorithm must adapt to some changes faster than to others. 
When an object starts to move that was formerly a part of 
the background, the newly revealed background, which will 
now be labeled as foreground, has to be integrated into the 
model as quickly as possible. At the same time, the moving 
object has to remain in the foreground, even when it stops 
moving and becomes stationary. When a foreground object 
becomes stationary it has the same spatio-temporal proper- 
ties as the newly revealed background, and consequently our 
algorithm will treat them equally. The demands are there- 
fore conflicting. We argue that the background subtraction 
algorithm presented here is not especially designed for this 
task, but through the individual weighting coefficients in- 
cluded in the cost function, pROST could solve this prob- 
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lem in principle. An unresolved problem remains the occur- 
rence of camouflaging. Besides the median filtering in the 
post-processing step the algorithm has no means of exploit- 
ing spatial correlation of pixel labels within a frame. It has 
to rely solely on color or intensity diff'erences to make seg- 
mentation decisions at pixel level, which makes it inapt to 
cope with this phenomenon. 

5.3 Choice of parameters 

In the following we explain the most important parameters, 
their influence on the performance and behavior of pROST 
and how we set them for the evaluation. The efl'ect of difl'er- 
ent parameter settings are illustrated by results obtained in 
the changedetection.net benchmark. 

Subspace dimension The subspace dimension k defined by 
the Stiefel dimension ([T]) can have considerable impact on 
the computational complexity of the algorithm, but also on 
other performance measures. Choosing an overly large value 
for k leads to excessive computational complexity while not 
increasing performance. Low-complexity backgrounds like 
those in the baseline category can be represented with a di- 
mension as low as ^ = 2, while the representation of highly 
dynamic backgrounds like in the camera jitter category can 




(a) Frame 467 of bungalows video and ground truth 




(b) no foreground weighting 



A 




(c) with foreground weighting a; = 10 ^ 



Fig. 5 Effects of foreground weighting on dynamic backgrounds with 
extremely large foreground objects {k = 15, /? = 0.25, ji = 10~^) 



benefit from a higher limit on the subspace dimension. In 
Figure [3] the results for three diff'erent choices of k and the 
resulting background and segmentation for frame #1381 
of traffic can be seen. To get a more detailed impression 
of how the background is represented by pROST, we pro- 
vide some further insight in Figure [4] Notice that the back- 
ground contains only the dynamic aspects of the scene as the 
running average is subtracted from each video frame. With 
growing subspace dimension finer details of the background 
can be captured, but the model is also getting more flexi- 
ble in areas that do not require this flexibility. This causes 
integration of foreground objects into the background. We 
furthermore observed that there is a strong relationship be- 
tween image sizes and required subspace dimension. For im- 
ages of size 160 X 120 a subspace dimension of 10 to 15 is 
sufficient, even for the highly dynamic camera jitter back- 
grounds. For images of size 320 x 240 the camera jitter cat- 
egory requires much higher-dimensional subspaces and also 
a longer initialization. This problem can be mitigated by re- 
ducing the information content in the image, for example by 
band-limiting the image with a Gaussian blur filter. Another 
approach is downsampling the image before the processing 
and upsampling the segmentation mask for the evaluation. 
This is clearly preferable, because it also reduces computa- 
tional complexity. 

Foreground weighting The foreground weighting parame- 
ter <jj from ( |22| ) has a large effect on the algorithm's boot- 
strapping capability, how it deals with highly dynamic com- 
plex backgrounds and robustness to large foreground ob- 
jects. In some scenarios like e.g. the bungalows video in the 
shadow category foreground weighting is a crucial compo- 
nent for recovering a background model that is not corrupted 
by foreground objects. Figure |5] showcases this effect. We 
found that foreground weighting allows for larger step sizes 
tfnin without producing ghost images. 

Step size The interplay between foreground weighting and 
different choices for the step size is displayed in Figure [6] 
Without any weighting the dynamic elements can be com- 
pensated using a large step size, but large foreground objects 
are incorporated too quickly into the background, which leads 
to reconstruction failure in some cases. A very small step 
size prohibits the formation of ghost images, but makes it 
impossible to adapt to the dynamic elements and lighting 
changes. Finally, combining a large step size and foreground 
weighting solves both problems. 

Cost function parameters In contrast to the ^i-norm, the 
choice of p and ji in the smoothed ^^-norm ^ offers con- 
trol over the degree of robustness to outliers in the data. To 
analyze this effect, we extend the definition of the smoothed 
^^-norm cost function to the case p > 1. Figure [7] illustrates 
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(a) Frame 1381 of trajfic video and ground truth (b) k=5 

Fig. 3 Effects of subspace dimensionality in jittery videos {p = 1.0,ju = 10~^, oj = 1.0) 



(c) k=15 



(d) k=40 



Jiv, ^^^^^^^ 








(a) Frame 1291 of traffic video and ground truth 







(b) p=2.0 




(c) p=1.0 





(a) Frame 1243 of traffic video and ground truth 




(b) f2 = 10-1 




(c) lu = 10- 




(d) p=0.5 

Fig. 7 Effects of ^^-norm parameter p on background model learning 
and fitting in jittery videos (k =15,jd= 10~^, oj = 1.0) 



(d) fi = 10-5 

Fig. 8 Effects of ^^-norm parameter ju on background model learning 
and fitting in jittery videos (k = 15, p = 0.5, oj = 1.0) 



that lowering the parameter p can reduce the rate of incor- 
porating foreground objects into the background. 

As discussed in Section |2] Robust PCA algorithms aim 
for accurately recovering the assumed low-rank component 
L of a data matrix X. However, when a decision is based on 



thresholding of 5" = X-L, a certain degree of reconstruction 
accuracy is sufficient and further decreasing the reconstruc- 
tion error after it has fallen below the threshold does not 
add to the performance, but unnecessarily requires compu- 
tational resources. In this sense, an estimate of L should be 
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(a) Frame 180 


























(b) Frame 500 

























(c) Frame 1000 

Fig. 4 Elements of the evolving 15 dimensional subspace for the traffic dataset. 




(a) Frame 2622 of the fall video and ground truth (b) Frame 2622, no weighting and small step size (tmin = 10 ^) 




(c) Frame 2049, no weighting and large step-size (tmin = 10 ^) (d) Frame 2622, weighting and large step-size (tmin = 10 ^) 




(g) Frame 2643, no weighting and large step-size (tmin = 10 ^) (h) Frame 2643, weighting and large step-size (tmin = 10 ^) 

Fig. 6 Effects of foreground weighting on bootstrapping with extremely large foreground objects {k = 15, p = 0.5, ju = 10~^) 
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considered sufficiently close to the true L if it does not pro- 
duce a false positive. To reflect this idea in the cost function, 
we require that starting at the threshold d the partial deriva- 
tive of the cost function with respect to the reconstruction 
error becomes smaller as we approach zero and becomes 
larger as we approach the threshold from above. This trans- 
lates to the constraint 



0, for fixed p <\. 



A quick calculation reveals that 



■Pi 



(33) 



(34) 



meets this constraint. To back-up that this coupling between 
ji and S indeed leads to near-optimal smoothing, we conduct 
an experiment in which we evaluate a number of smooth- 
ing parameters for two diff'erent thresholds on the traffic and 



badminton videos. The results in Figure |5.3| confirm our 
theoretical analysis. Furthermore, the performance degrades 
markedly if a very small smoothing parameter is chosen. 

6 Conclusion 

We have presented a novel subspace tracking algorithm which 
combines concepts of previous methods and introduces a 
novel weighted robust cost function tailored to the task of 
background modelling and foreground segmentation from 
video data. The method is implemented on a GPU, achieves 
frame rates between 30 and 45 FPS on images in a resolu- 
tion of 160 X 120 and is thus real-time capable. One of the 
noteworthy features of the method is that it does not need a 
batch initialization phase, but learns the background model 
from corrupted streaming video. This has the advantage that 
no camera frames need to be stored at any time during oper- 
ation. 

pROST should be considered a basic building block for 
larger, more sophisticated systems for background subtrac- 
tion. Future work will include the extension with a shadow 
detection mechanism and more sophisticated pre- and post- 
processing techniques. 

We have evaluated the algorithm on the changedetec- 
tion.net benchmark and show that it outperforms the con- 
ceptually similar GRASTA algorithm in many categories. 
Our method is particularly suitable for videos recorded by 
highly unstable cameras, ranking first in this category by a 
large margin, and it can thus be considered an advancement 
in this research area. 
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